50 research outputs found

    Wavelet Trees Meet Suffix Trees

    No full text
    We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size σ≀n\sigma\leq n, our method builds the wavelet tree in O(nlog⁑σ/log⁑n)O(n \log \sigma/ \sqrt{\log{n}}) time, improving upon the state-of-the-art algorithm by a factor of log⁑n\sqrt{\log n}. As a consequence, given an array of n integers we can construct in O(nlog⁑n)O(n \sqrt{\log n}) time a data structure consisting of O(n)O(n) machine words and capable of answering rank/select queries for the subranges of the array in O(log⁑n/log⁑log⁑n)O(\log n / \log \log n) time. This is a log⁑log⁑n\log \log n-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a log⁑n\sqrt{\log n}-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies O(n)O(n) words, takes O(nlog⁑n)O(n \sqrt{\log n}) time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in O(log⁑∣x∣)O(\log |x|) time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in O(slog⁑∣x∣)O(s \log |x|) time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression

    Lightweight Lempel-Ziv Parsing

    Full text link
    We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

    Longest property-preserved common factor

    Get PDF
    In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider two fundamental string properties: square-free factors and periodic factors under two different settings, one per property. In the first setting, we are given a string x and we are asked to construct a data structure over x answering the following type of on-line queries: given string y, find a longest square-free factor common to x and y. In the second setting, we are given k strings and an integer 1 < k’ ≀ k and we are asked to find a longest periodic factor common to at least k’ strings. We present linear-time solutions for both settings. We anticipate that our paradigm can be extended to other string properties

    Association of Mitochondrial DNA Variations with Lung Cancer Risk in a Han Chinese Population from Southwestern China

    Get PDF
    Mitochondrial DNA (mtDNA) is particularly susceptible to oxidative damage and mutation due to the high rate of reactive oxygen species (ROS) production and limited DNA-repair capacity in mitochondrial. Previous studies demonstrated that the increased mtDNA copy number for compensation for damage, which was associated with cigarette smoking, has been found to be associated with lung cancer risk among heavy smokers. Given that the common and β€œnon-pathological” mtDNA variations determine differences in oxidative phosphorylation performance and ROS production, an important determinant of lung cancer risk, we hypothesize that the mtDNA variations may play roles in lung cancer risk. To test this hypothesis, we conducted a case-control study to compare the frequencies of mtDNA haplogroups and an 822 bp mtDNA deletion between 422 lung cancer patients and 504 controls. Multivariate logistic regression analysis revealed that haplogroups D and F were related to individual lung cancer resistance (ORβ€Š=β€Š0.465, 95%CIβ€Š=β€Š0.329–0.656, p<0.001; and ORβ€Š=β€Š0.622, 95%CIβ€Š=β€Š0.425–0.909, pβ€Š=β€Š0.014, respectively), while haplogroups G and M7 might be risk factors for lung cancer (ORβ€Š=β€Š3.924, 95%CIβ€Š=β€Š1.757–6.689, p<0.001; and ORβ€Š=β€Š2.037, 95%CIβ€Š=β€Š1.253–3.312, pβ€Š=β€Š0.004, respectively). Additionally, multivariate logistic regression analysis revealed that cigarette smoking was a risk factor for the 822 bp mtDNA deletion. Furthermore, the increased frequencies of the mtDNA deletion in male cigarette smoking subjects of combined cases and controls with haplogroup D indicated that the haplogroup D might be susceptible to DNA damage from external ROS caused by heavy cigarette smoking

    Origin and Post-Glacial Dispersal of Mitochondrial DNA Haplogroups C and D in Northern Asia

    Get PDF
    More than a half of the northern Asian pool of human mitochondrial DNA (mtDNA) is fragmented into a number of subclades of haplogroups C and D, two of the most frequent haplogroups throughout northern, eastern, central Asia and America. While there has been considerable recent progress in studying mitochondrial variation in eastern Asia and America at the complete genome resolution, little comparable data is available for regions such as southern Siberia – the area where most of northern Asian haplogroups, including C and D, likely diversified. This gap in our knowledge causes a serious barrier for progress in understanding the demographic pre-history of northern Eurasia in general. Here we describe the phylogeography of haplogroups C and D in the populations of northern and eastern Asia. We have analyzed 770 samples from haplogroups C and D (174 and 596, respectively) at high resolution, including 182 novel complete mtDNA sequences representing haplogroups C and D (83 and 99, respectively). The present-day variation of haplogroups C and D suggests that these mtDNA clades expanded before the Last Glacial Maximum (LGM), with their oldest lineages being present in the eastern Asia. Unlike in eastern Asia, most of the northern Asian variants of haplogroups C and D began the expansion after the LGM, thus pointing to post-glacial re-colonization of northern Asia. Our results show that both haplogroups were involved in migrations, from eastern Asia and southern Siberia to eastern and northeastern Europe, likely during the middle Holocene

    The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers

    Get PDF
    The Koreans are generally considered a northeast Asian group because of their geographical location. However, recent findings from Y chromosome studies showed that the Korean population contains lineages from both southern and northern parts of East Asia. To understand the genetic history and relationships of Korea more fully, additional data and analyses are necessary.We analyzed mitochondrial DNA (mtDNA) sequence variation in the hypervariable segments I and II (HVS-I and HVS-II) and haplogroup-specific mutations in coding regions in 445 individuals from seven east Asian populations (Korean, Korean-Chinese, Mongolian, Manchurian, Han (Beijing), Vietnamese and Thais). In addition, published mtDNA haplogroup data (N = 3307), mtDNA HVS-I sequences (N = 2313), Y chromosome haplogroup data (N = 1697) and Y chromosome STR data (N = 2713) were analyzed to elucidate the genetic structure of East Asian populations. All the mtDNA profiles studied here were classified into subsets of haplogroups common in East Asia, with just two exceptions. In general, the Korean mtDNA profiles revealed similarities to other northeastern Asian populations through analysis of individual haplogroup distributions, genetic distances between populations or an analysis of molecular variance, although a minor southern contribution was also suggested. Reanalysis of Y-chromosomal data confirmed both the overall similarity to other northeastern populations, and also a larger paternal contribution from southeastern populations.The present work provides evidence that peopling of Korea can be seen as a complex process, interpreted as an early northern Asian settlement with at least one subsequent male-biased southern-to-northern migration, possibly associated with the spread of rice agriculture

    Archaeological Support for the Three-Stage Expansion of Modern Humans across Northeastern Eurasia and into the Americas

    Get PDF
    Background Understanding the dynamics of the human range expansion across northeastern Eurasia during the late Pleistocene is central to establishing empirical temporal constraints on the colonization of the Americas [1]. Opinions vary widely on how and when the Americas were colonized, with advocates supporting either a pre-[2] or post-[1], [3], [4], [5], [6] last glacial maximum (LGM) colonization, via either a land bridge across Beringia [3], [4], [5], a sea-faring Pacific Rim coastal route [1], [3], a trans-Arctic route [4], or a trans-Atlantic oceanic route [5]. Here we analyze a large sample of radiocarbon dates from the northeast Eurasian Upper Paleolithic to identify the origin of this expansion, and estimate the velocity of colonization wave as it moved across northern Eurasia and into the Americas. Methodology/Principal Findings We use diffusion models [6], [7] to quantify these dynamics. Our results show the expansion originated in the Altai region of southern Siberia ~46kBP , and from there expanded across northern Eurasia at an average velocity of 0.16 km per year. However, the movement of the colonizing wave was not continuous but underwent three distinct phases: 1) an initial expansion from 47-32k calBP; 2) a hiatus from ~32-16k calBP, and 3) a second expansion after the LGM ~16k calBP. These results provide archaeological support for the recently proposed three-stage model of the colonization of the Americas [8], [9]. Our results falsify the hypothesis of a pre-LGM terrestrial colonization of the Americas and we discuss the importance of these empirical results in the light of alternative models. Conclusions/Significance Our results demonstrate that the radiocarbon record of Upper Paleolithic northeastern Eurasia supports a post-LGM terrestrial colonization of the Americas falsifying the proposed pre-LGM terrestrial colonization of the Americas. We show that this expansion was not a simple process, but proceeded in three phases, consistent with genetic data, largely in response to the variable climatic conditions of late Pleistocene northeast Eurasia. Further, the constraints imposed by the spatiotemporal gradient in the empirical radiocarbon record across this entire region suggests that North America cannot have been colonized much before the existing Clovis radiocarbon record suggests

    Beringian Standstill and Spread of Native American Founders

    Get PDF
    Native Americans derive from a small number of Asian founders who likely arrived to the Americas via Beringia. However, additional details about the intial colonization of the Americas remain unclear. To investigate the pioneering phase in the Americas we analyzed a total of 623 complete mtDNAs from the Americas and Asia, including 20 new complete mtDNAs from the Americas and seven from Asia. This sequence data was used to direct high-resolution genotyping from 20 American and 26 Asian populations. Here we describe more genetic diversity within the founder population than was previously reported. The newly resolved phylogenetic structure suggests that ancestors of Native Americans paused when they reached Beringia, during which time New World founder lineages differentiated from their Asian sister-clades. This pause in movement was followed by a swift migration southward that distributed the founder types all the way to South America. The data also suggest more recent bi-directional gene flow between Siberia and the North American Arctic

    Computing Minimal and Maximal Suffixes of a Substring Revisited

    No full text
    corecore